{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Simple RAG (Retrieval-Augmented Generation) System\n",
    "\n",
    "## Overview\n",
    "\n",
    "This code implements a basic Retrieval-Augmented Generation (RAG) system for processing and querying PDF document(s). The system uses a pipeline that encodes the documents and creates nodes. These nodes then can be used to build a vector index to retrieve relevant information.\n",
    "\n",
    "## Key Components\n",
    "\n",
    "1. PDF processing and text extraction\n",
    "2. Text chunking for manageable processing\n",
    "3. Ingestion pipeline creation using FAISS as vector store and OpenAI embeddings\n",
    "4. Retriever setup for querying the processed documents\n",
    "5. Evaluation of the RAG system\n",
    "\n",
    "## Method Details\n",
    "\n",
    "### Document Preprocessing\n",
    "\n",
    "1. The PDF is loaded using [SimpleDirectoryReader](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/).\n",
    "2. The text is split into [nodes/chunks](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/) using [SentenceSplitter](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/#sentencesplitter) with specified chunk size and overlap.\n",
    "\n",
    "### Text Cleaning\n",
    "\n",
    "A custom transformation `TextCleaner` is applied to clean the texts. This likely addresses specific formatting issues in the PDF.\n",
    "\n",
    "### Ingestion Pipeline Creation\n",
    "\n",
    "1. OpenAI embeddings are used to create vector representations of the text nodes.\n",
    "2. A FAISS vector store is created from these embeddings for efficient similarity search.\n",
    "\n",
    "### Retriever Setup\n",
    "\n",
    "1. A retriever is configured to fetch the top 2 most relevant chunks for a given query.\n",
    "\n",
    "\n",
    "## Key Features\n",
    "\n",
    "1. Modular Design: The ingestion process is encapsulated in a single function for easy reuse.\n",
    "2. Configurable Chunking: Allows adjustment of chunk size and overlap.\n",
    "3. Efficient Retrieval: Uses FAISS for fast similarity search.\n",
    "4. Evaluation: Includes a function to evaluate the RAG system's performance.\n",
    "\n",
    "## Usage Example\n",
    "\n",
    "The code includes a test query: \"What is the main cause of climate change?\". This demonstrates how to use the retriever to fetch relevant context from the processed document.\n",
    "\n",
    "## Evaluation\n",
    "\n",
    "The system includes an `evaluate_rag` function to assess the performance of the retriever, though the specific metrics used are not detailed in the provided code.\n",
    "\n",
    "## Benefits of this Approach\n",
    "\n",
    "1. Scalability: Can handle large documents by processing them in chunks.\n",
    "2. Flexibility: Easy to adjust parameters like chunk size and number of retrieved results.\n",
    "3. Efficiency: Utilizes FAISS for fast similarity search in high-dimensional spaces.\n",
    "4. Integration with Advanced NLP: Uses OpenAI embeddings for state-of-the-art text representation.\n",
    "\n",
    "## Conclusion\n",
    "\n",
    "This simple RAG system provides a solid foundation for building more complex information retrieval and question-answering systems. By encoding document content into a searchable vector store, it enables efficient retrieval of relevant information in response to queries. This approach is particularly useful for applications requiring quick access to specific information within large documents or document collections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import libraries and environment variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List\n",
    "from llama_index.core import SimpleDirectoryReader, VectorStoreIndex\n",
    "from llama_index.core.ingestion import IngestionPipeline\n",
    "from llama_index.core.schema import BaseNode, TransformComponent\n",
    "from llama_index.vector_stores.faiss import FaissVectorStore\n",
    "from llama_index.core.text_splitter import SentenceSplitter\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "from llama_index.core import Settings\n",
    "import faiss\n",
    "import os\n",
    "import sys\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..'))) # Add the parent directory to the path sicnce we work with notebooks\n",
    "\n",
    "EMBED_DIMENSION = 512\n",
    "\n",
    "# Chunk settings are way different than langchain examples\n",
    "# Beacuse for the chunk length langchain uses length of the string,\n",
    "# while llamaindex uses length of the tokens\n",
    "CHUNK_SIZE = 200\n",
    "CHUNK_OVERLAP = 50\n",
    "\n",
    "# Load environment variables from a .env file\n",
    "load_dotenv()\n",
    "\n",
    "# Set the OpenAI API key environment variable\n",
    "os.environ[\"OPENAI_API_KEY\"] = os.getenv('OPENAI_API_KEY')\n",
    "\n",
    "# Set embeddig model on LlamaIndex global settings\n",
    "Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-small\", dimensions=EMBED_DIMENSION)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read Docs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = \"../data/\"\n",
    "node_parser = SimpleDirectoryReader(input_dir=path, required_exts=['.pdf'])\n",
    "documents = node_parser.load_data()\n",
    "print(documents[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vector Store"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create FaisVectorStore to store embeddings\n",
    "faiss_index = faiss.IndexFlatL2(EMBED_DIMENSION)\n",
    "vector_store = FaissVectorStore(faiss_index=faiss_index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text Cleaner Transformation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TextCleaner(TransformComponent):\n",
    "    \"\"\"\n",
    "    Transformation to be used within the ingestion pipeline.\n",
    "    Cleans clutters from texts.\n",
    "    \"\"\"\n",
    "    def __call__(self, nodes, **kwargs) -> List[BaseNode]:\n",
    "        \n",
    "        for node in nodes:\n",
    "            node.text = node.text.replace('\\t', ' ') # Replace tabs with spaces\n",
    "            node.text = node.text.replace(' \\n', ' ') # Replace paragraph seperator with spacaes\n",
    "            \n",
    "        return nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ingestion Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_splitter = SentenceSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)\n",
    "\n",
    "# Create a pipeline with defined document transformations and vectorstore\n",
    "pipeline = IngestionPipeline(\n",
    "    transformations=[\n",
    "        TextCleaner(),\n",
    "        text_splitter,\n",
    "    ],\n",
    "    vector_store=vector_store, \n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run pipeline and get generated nodes from the process\n",
    "nodes = pipeline.run(documents=documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "vector_store_index = VectorStoreIndex(nodes)\n",
    "retriever = vector_store_index.as_retriever(similarity_top_k=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Test retriever"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_context(context):\n",
    "    \"\"\"\n",
    "    Display the contents of the provided context list.\n",
    "\n",
    "    Args:\n",
    "        context (list): A list of context items to be displayed.\n",
    "\n",
    "    Prints each context item in the list with a heading indicating its position.\n",
    "    \"\"\"\n",
    "    for i, c in enumerate(context):\n",
    "        print(f\"Context {i+1}:\")\n",
    "        print(c.text)\n",
    "        print(\"\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Context 1:\n",
      "Chapter 2: Causes of Climate Change  Greenhouse Gases  The primary cause of recent climate change is the increase in greenhouse gases in the atmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous oxide (N2O), trap heat from the sun, creating a \"greenhouse effect.\" This effect is  essential for life on Earth, as it keeps the planet warm enough to support life. However, human activities have intensified this natural process, leading to a warmer climate.  Fossil Fuels  Burning fossil fuels for energy releases large amounts of CO2.\n",
      "\n",
      "\n",
      "Context 2:\n",
      "The Intergovernmental Panel on Climate Change (IPCC) has documented these changes extensively. Ice core samples, tree rings, and ocean sediments provide a historical record that scientists use to understand past climate conditions and predict future trends. The evidence overwhelmingly shows that recent changes are primarily driven by human activities, particularly the emission of greenhou se gases.  Chapter 2: Causes of Climate Change  Greenhouse Gases  The primary cause of recent climate change is the increase in greenhouse gases in the atmosphere.\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "test_query = \"What is the main cause of climate change?\"\n",
    "context = retriever.retrieve(test_query)\n",
    "show_context(context)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's see how well does it perform:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from deepeval import evaluate\n",
    "from deepeval.metrics import GEval, FaithfulnessMetric, ContextualRelevancyMetric\n",
    "from deepeval.test_case import LLMTestCaseParams\n",
    "from evaluation.evalute_rag import create_deep_eval_test_cases\n",
    "\n",
    "# Set llm model for evaluation of the question and answers \n",
    "LLM_MODEL = \"gpt-4o\"\n",
    "\n",
    "# Define evaluation metrics\n",
    "correctness_metric = GEval(\n",
    "    name=\"Correctness\",\n",
    "    model=LLM_MODEL,\n",
    "    evaluation_params=[\n",
    "        LLMTestCaseParams.EXPECTED_OUTPUT,\n",
    "        LLMTestCaseParams.ACTUAL_OUTPUT\n",
    "    ],\n",
    "    evaluation_steps=[\n",
    "        \"Determine whether the actual output is factually correct based on the expected output.\"\n",
    "    ],\n",
    ")\n",
    "\n",
    "faithfulness_metric = FaithfulnessMetric(\n",
    "    threshold=0.7,\n",
    "    model=LLM_MODEL,\n",
    "    include_reason=False\n",
    ")\n",
    "\n",
    "relevance_metric = ContextualRelevancyMetric(\n",
    "    threshold=1,\n",
    "    model=LLM_MODEL,\n",
    "    include_reason=True\n",
    ")\n",
    "\n",
    "def evaluate_rag(query_engine, num_questions: int = 5) -> None:\n",
    "    \"\"\"\n",
    "    Evaluate the RAG system using predefined metrics.\n",
    "\n",
    "    Args:\n",
    "        query_engine: Query engine to ask questions and get answers along with retrieved context.\n",
    "        num_questions (int): Number of questions to evaluate (default: 5).\n",
    "    \"\"\"\n",
    "    \n",
    "    \n",
    "    # Load questions and answers from JSON file\n",
    "    q_a_file_name = \"../data/q_a.json\"\n",
    "    with open(q_a_file_name, \"r\", encoding=\"utf-8\") as json_file:\n",
    "        q_a = json.load(json_file)\n",
    "\n",
    "    questions = [qa[\"question\"] for qa in q_a][:num_questions]\n",
    "    ground_truth_answers = [qa[\"answer\"] for qa in q_a][:num_questions]\n",
    "    generated_answers = []\n",
    "    retrieved_documents = []\n",
    "\n",
    "    # Generate answers and retrieve documents for each question\n",
    "    for question in questions:\n",
    "        response = query_engine.query(question)\n",
    "        context = [doc.text for doc in response.source_nodes]\n",
    "        retrieved_documents.append(context)\n",
    "        generated_answers.append(response.response)\n",
    "\n",
    "    # Create test cases and evaluate\n",
    "    test_cases = create_deep_eval_test_cases(questions, ground_truth_answers, generated_answers, retrieved_documents)\n",
    "    evaluate(\n",
    "        test_cases=test_cases,\n",
    "        metrics=[correctness_metric, faithfulness_metric, relevance_metric]\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "query_engine  = vector_store_index.as_query_engine(similarity_top_k=2)\n",
    "evaluate_rag(query_engine, num_questions=1)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
